Towards Efficient Positional Inverted Index †

نویسندگان

  • Petr Procházka
  • Jan Holub
چکیده

We address the problem of positional indexing in the natural language domain. The positional inverted index contains the information of the word positions. Thus, it is able to recover the original text file, which implies that it is not necessary to store the original file. Our Positional Inverted Self-Index (PISI) stores the word position gaps encoded by variable byte code. Inverted lists of single terms are combined into one inverted list that represents the backbone of the text file since it stores the sequence of the indexed words of the original file. The inverted list is synchronized with a presentation layer that stores separators, stop words, as well as variants of the indexed words. The Huffman coding is used to encode the presentation layer. The space complexity of the PISI inverted list is O((N − n)dlog2b Ne+ (bN−n α c+ n)× (dlog2b ne+ 1)) where N is a number of stems, n is a number of unique stems, α is a step/period of the back pointers in the inverted list and b is the size of the word of computer memory given in bits. The space complexity of the presentation layer is O(−∑ i=1dlog2 p n(i) i e −∑ ′ j=1dlog2 pje+ N) with respect to p n(i) i as a probability of a stem variant at position i, pj as the probability of separator or stop word at position j and N ′ as the number of separators and stop words.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Positional Data Organization and Compression in Web Inverted Indexes

To sustain the tremendous workloads they suffer on a daily basis, Web search engines employ highly compressed data structures known as inverted indexes. Previous works demonstrated that organizing the inverted lists of the index in individual blocks of postings leads to significant efficiency improvements. Moreover, the recent literature has shown that the current state-of-the-art compression s...

متن کامل

Phrase Queries with Inverted + Direct Indexes

Phrase queries play an important role in web search and other applications. Traditionally, phrase queries have been processed using a positional inverted index, potentially augmented by selected multiword sequences (e.g., n-grams or frequent noun phrases). In this work, instead of augmenting the inverted index, we take a radically different approach and leverage the direct index, which provides...

متن کامل

Intra-Positional and Inter-Positional Differences in Somatotype Components and Proportions of Particular Somatotype Categories in Youth Volleyball Players

Objective(s). Main aim of this cross-sectional study was to analyse intra-positional, inter-positional differences in proportions of particular somatotype categories in youth volleyball players. Methods. Heath-Carter method was used to determine somatotype characteristics of 181 young female volleyball players (age 14.05±0.93, height 170.03±7.61 cm, mass 57.80±8.59 kg, bod...

متن کامل

Towards Efficient SPARQL Query Processing on RDF Data

Efficient support for querying large-scale RDF triples plays an important role in Semantic Web data management. This paper proposes an efficient RDF query engine to evaluate SPARQL queries, where the inverted index structure is employed for indexing RDF triples. We first design and implement a set of operators on the inverted index for query optimization and evaluation. Then we propose a main-t...

متن کامل

Window Extraction for Information Retrieval

Proximity-based term dependencies have been proposed and used in a variety of effective retrieval models. The execution of these dependency models is commonly supported through the use of positional inverted indexes. However, few of these models detail how instances of proximate terms should be extracted from the lists of positional data. In this study, we investigate three algorithms for the e...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Algorithms

دوره 10  شماره 

صفحات  -

تاریخ انتشار 2017